Data manipulation

tidyverse, importing data

Gerko Vink

Methodology & Statistics @ Utrecht University

Statistical programming with R team

Summerschool @ Utrecht University

5 Jun 2025

Disclaimer

I owe a debt of gratitude to many people as the thoughts and code in these slides are the process of years-long development cycles and discussions with my team, friends, colleagues and peers. When someone has contributed to the content of the slides, I have credited their authorship.

These materials are generated by Gerko Vink, who holds the copyright. The intellectual property belongs to Utrecht University. Images are either directly linked, or generated with StableDiffusion or DALL-E. That said, there is no information in this presentation that exceeds legal use of copyright materials in academic settings, or that should not be part of the public domain.

Warning

You may use any and all content in this presentation - including my name - and submit it as input to generative AI tools, with the following exception:

  • You must ensure that the content is not used for further training of the model

Slide materials and source code

Materials

Recap

Yesterday we have learned:

Today

Today we will learn how to:

Topics of this lecture

The tidyverse packages

  • dplyr package
  • pipe operator %>%
  • standard solves for missing data

Importing data

The tidyverse packages

tidyverse and the data analysis cycle

Tidyverse and the verbs of data manipulation

Leading principle: language of programming should really behave like a language, tidyverse.


tidyverse: a few key verb that perform common types of data manipulation.

Tidy data

The tidyverse packages operate on tidy data:

  1. Each column is a variable

  2. Each row is an observation

  3. Each cell is a single value


Untidy versus tidy data

The dplyr package

Data manipulation with dplyr

The dplyr package is a specialized package for working with data.frames (and the related tibble) to transform and summarize tabular data:

  • summary statistics for grouped data
  • selecting variables
  • filtering cases
  • (re)arranging cases
  • computing new variables
  • recoding variables

dplyr cheatsheet

Why dplyr?

  • Targeted for data analysis with data frames (tibbles)
  • More intuitive and easier than base R functions
  • No need for working with square brackets


In combination with the pipe operator %>%:

  • performs a series of operations step-by-step
  • no need to write code inside-out
  • no need to make intermediate objects

dplyr functions in this lecture

There are many functions available in dplyr, but we will focus on just the following dplyr functions (verbs):

dplyr verbs Description
glimpse() a transposed print of the data that shows all variables
select() selects variables (columns) based on their names
filter() subsets the rows of a data frame based on their values
arrange() re-order or arrange rows
mutate() adds new variables, or new variables that are functions of existing variables
summarise() creates a new data frame with statistics of the variables (optional grouped by another variables)
group_by() allows for group operations in the “split-apply-combine” concept

Check the dplyr cheat sheet for examples.

dplyr::glimpse()

  • Prints a transposed version of the data: variables are the rows, observations are the columns.
  • Makes it possible to see every column in a data frame.
  • It is similar to str(), but shows more data.
  • str() shows more detailed information about data structure.
dplyr::glimpse(planets)
Rows: 8
Columns: 4
$ planet_type <fct> Terrestrial planet, Terrestrial planet, Terrestrial planet…
$ diameter    <dbl> 0.382, 0.949, 1.000, 0.532, 11.209, 9.449, 4.007, 3.883
$ rotation    <dbl> 58.64, -243.02, 1.00, 1.03, 0.41, 0.43, -0.72, 0.67
$ rings       <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE
str(planets)
'data.frame':   8 obs. of  4 variables:
 $ planet_type: Factor w/ 2 levels "Gas giant","Terrestrial planet": 2 2 2 2 1 1 1 1
 $ diameter   : num  0.382 0.949 1 0.532 11.209 ...
 $ rotation   : num  58.64 -243.02 1 1.03 0.41 ...
 $ rings      : logi  FALSE FALSE FALSE FALSE TRUE TRUE ...

Select columns with dplyr::select()

Select variables type and diameter from the planets data frame:

dplyr::select(planets, planet_type, diameter)
               planet_type diameter
Mercury Terrestrial planet    0.382
Venus   Terrestrial planet    0.949
Earth   Terrestrial planet    1.000
Mars    Terrestrial planet    0.532
Jupiter          Gas giant   11.209
Saturn           Gas giant    9.449
Uranus           Gas giant    4.007
Neptune          Gas giant    3.883

Select columns with dplyr::select()

Select numerical variables with where(is.numeric):

dplyr::select(planets, where(is.numeric))
        diameter rotation
Mercury    0.382    58.64
Venus      0.949  -243.02
Earth      1.000     1.00
Mars       0.532     1.03
Jupiter   11.209     0.41
Saturn     9.449     0.43
Uranus     4.007    -0.72
Neptune    3.883     0.67

Select rows with dplyr::filter()

Selects subsets of the rows of a data frame based on their values.

Select the planets that have a ring and that are gas giants:

dplyr::filter(planets, rings == TRUE, planet_type == "Gas giant")
        planet_type diameter rotation rings
Jupiter   Gas giant   11.209     0.41  TRUE
Saturn    Gas giant    9.449     0.43  TRUE
Uranus    Gas giant    4.007    -0.72  TRUE
Neptune   Gas giant    3.883     0.67  TRUE

The pipe operator %>%

Shortcut key: Ctrl/Cmd + Shift + M

How %>% works

Passes (transformed) data on to the next operation

  • avoids nested code

  • avoids creation of intermediate objects


Basic principle of %>% operator

data %>% 
  
  perform operation A and pass on transformed data %>% 
  
  perform operation B and pass on transformed data %>% 
  
  etc.

dplyr::select() with the pipe operator

dplyr::select without %>%

dplyr::select(planets, planet_type, diameter)
               planet_type diameter
Mercury Terrestrial planet    0.382
Venus   Terrestrial planet    0.949
Earth   Terrestrial planet    1.000
Mars    Terrestrial planet    0.532
Jupiter          Gas giant   11.209
Saturn           Gas giant    9.449
Uranus           Gas giant    4.007
Neptune          Gas giant    3.883

Straightforward and might be more familiar to those used to base R functions.

dplyr::select with %>%

planets %>%
  dplyr::select(planet_type, diameter)
               planet_type diameter
Mercury Terrestrial planet    0.382
Venus   Terrestrial planet    0.949
Earth   Terrestrial planet    1.000
Mars    Terrestrial planet    0.532
Jupiter          Gas giant   11.209
Saturn           Gas giant    9.449
Uranus           Gas giant    4.007
Neptune          Gas giant    3.883

More readable and flexible when chaining multiple dplyr functions.

Compute new variables with dplyr::mutate()

dplyr::mutate() adds a new variable to the data frame.

data %>% 
  dplyr::mutate(..., .keep = c("all", ...), .before = NULL, .after = NULL)


Arguments:

.keep specifies which variables to return, “all”, “used”, “unused”, “none”.

.before or .after determine where the new variables are inserted.

Compute new variables with dplyr::mutate()

data %>% 
  dplyr::mutate(..., .keep = c("all", ...), .before = NULL, .after = NULL)


Example: compute a new variable rotation_diameter = rotation/diameter, add it to the data frame and keep all other variables:

planets %>% 
  dplyr::mutate(rotation_diameter = rotation/diameter, .keep = "all") %>%
  glimpse()
Rows: 8
Columns: 5
$ planet_type       <fct> Terrestrial planet, Terrestrial planet, Terrestrial …
$ diameter          <dbl> 0.382, 0.949, 1.000, 0.532, 11.209, 9.449, 4.007, 3.…
$ rotation          <dbl> 58.64, -243.02, 1.00, 1.03, 0.41, 0.43, -0.72, 0.67
$ rings             <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE
$ rotation_diameter <dbl> 153.50785340, -256.08008430, 1.00000000, 1.93609023,…

Temporary / permanent changes

The pipe operations do not make changes to the original data set, unless you save the results:

Temporary:

planets %>% 
  dplyr::mutate(rotation_diameter = 
                  rotation/diameter)
names(planets)
[1] "planet_type" "diameter"    "rotation"    "rings"      

Changes saved in new data frame:

new_data_set <- planets %>% 
  dplyr::mutate(rotation_diameter = 
                  rotation/diameter) 
names(new_data_set)
[1] "planet_type"       "diameter"          "rotation"         
[4] "rings"             "rotation_diameter"

Re-order rows with dplyr::arrange()

Order the rows of the planets data set on ascending values of diameter:


Original data set:

planets
               planet_type diameter rotation rings
Mercury Terrestrial planet    0.382    58.64 FALSE
Venus   Terrestrial planet    0.949  -243.02 FALSE
Earth   Terrestrial planet    1.000     1.00 FALSE
Mars    Terrestrial planet    0.532     1.03 FALSE
Jupiter          Gas giant   11.209     0.41  TRUE
Saturn           Gas giant    9.449     0.43  TRUE
Uranus           Gas giant    4.007    -0.72  TRUE
Neptune          Gas giant    3.883     0.67  TRUE

Ordered data set, based on diameter:

planets %>% 
  dplyr::arrange(diameter)
               planet_type diameter rotation rings
Mercury Terrestrial planet    0.382    58.64 FALSE
Mars    Terrestrial planet    0.532     1.03 FALSE
Venus   Terrestrial planet    0.949  -243.02 FALSE
Earth   Terrestrial planet    1.000     1.00 FALSE
Neptune          Gas giant    3.883     0.67  TRUE
Uranus           Gas giant    4.007    -0.72  TRUE
Saturn           Gas giant    9.449     0.43  TRUE
Jupiter          Gas giant   11.209     0.41  TRUE

Multiple transformations: base R and dplyr

Suppose we want to perform the following transformations:

  1. Sort the rows of planets on ascending values of rotation
  2. Select only planets with diameter > 1
  3. Display the variables planet_type, diameter and rotation

With base R code:

subset(planets[order(planets$rotation), ],  
       subset = diameter > 1, 
       select = c(planet_type, diameter, 
                  rotation))
        planet_type diameter rotation
Uranus    Gas giant    4.007    -0.72
Jupiter   Gas giant   11.209     0.41
Saturn    Gas giant    9.449     0.43
Neptune   Gas giant    3.883     0.67

With dplyr and the pipe %>% operator

planets %>% 
  dplyr::filter(diameter > 1) %>% 
  dplyr::arrange(rotation) %>% 
  dplyr::select(planet_type, diameter, rotation)
        planet_type diameter rotation
Uranus    Gas giant    4.007    -0.72
Jupiter   Gas giant   11.209     0.41
Saturn    Gas giant    9.449     0.43
Neptune   Gas giant    3.883     0.67

Summary statistics with summarise()

The dplyr function for summarizing data:

planets %>% 
  dplyr::summarise(
    mean_diameter = mean(diameter), 
    sd_diameter = sd(diameter)
  )
  mean_diameter sd_diameter
1      3.926375    4.226738
  • Various summary function(s):
    • mean(), median(), sd(), var(), sum(), for numeric variables
    • n(), n_distinct() for counts
    • many others, see: ?dplyr::select and cheat sheet)

Summaries for groups with group_by()

The dplyr function for grouping rows of a data frame is very useful in combination with summarise()

Example: group the planets based on having rings (or not) and compute the mean and the standard deviation for each group.

planets %>% 
  dplyr::group_by(rings) %>%
  dplyr::summarise(
    mean_diameter = mean(diameter), 
    sd_diameter = sd(diameter)
  )
# A tibble: 2 × 3
  rings mean_diameter sd_diameter
  <lgl>         <dbl>       <dbl>
1 FALSE         0.716       0.306
2 TRUE          7.14        3.76 

Pipes and the R model formula ~

Sometimes the data we want to use, are “piped” in the wrong argument, see e.g.:

planets %>% 
  lm(diameter ~ rotation)

This leads to an error:

Error in as.data.frame.default(data) :

cannot coerce class ‘“formula”’ to a data.frame

The %>% pipe operator works with functions where the first argument of the next function in the pipeline is a data frame.

For the function lm() the first argument is the model formula (a text object):

lm(x ~ y, data = dataframe)

Solution: use placeholder .

We can use the . symbol to act as placeholder for the data:

planets %>% 
  lm(diameter ~ rotation, data = .)

Call:
lm(formula = diameter ~ rotation, data = .)

Coefficients:
(Intercept)     rotation  
   4.126411     0.008814  

Standard solves for missing values

Dealing with missing values in R

Calculations based on missing values (NA’s) are not possible in R:

variable <- c(1, 2, NA, 4, 5)
mean(variable)
[1] NA

There are two easy ways to perform “listwise deletion”:

mean(variable, na.rm = TRUE)
[1] 3
mean(na.omit(variable))
[1] 3

Dealing with missing values with dplyr

df$score
[1]  1  2 NA  4  5

No solution for missing values:

df %>% 
  dplyr::summarise(
    mean_variable = mean(score), 
    sd_variable = sd(score)
  )
  mean_variable sd_variable
1            NA          NA

Use na.rm = TRUE:

df %>% 
  dplyr::summarise(
    mean_variable = mean(score, na.rm = TRUE), 
    sd_variable = sd(score, na.rm = TRUE)
  )
  mean_variable sd_variable
1             3    1.825742

Style guide for coding pipes

Code with a single pipe operator on one line and spaces around %>%:

data %>% dplyr::select(X)

Code with multiple pipe operators on multiple lines:

data %>% 
  dplyr::group_by(X) %>% 
  dplyr::filter(Y > 4) %>% 
  dplyr::summarise(mean(Y))

but definitely NOT:

data%>%dplyr::group_by(X)%>%dplyr::filter(Y>4)%>%dplyr::summarise(mean(Y))

More about coding style: tidyverse style guide

https://style.tidyverse.org/index.html

Importing data into R

R data format and workspace: .RData

The R data file format is .RData.

In RStudio your data and the alterations on the data, are saved in the workspace.

  • A workspace contains all changes you made to your data and functions during a session.
  • Workspaces are compressed and require relatively little memory when stored. The compression is very efficient and beats reloading large data sets from raw text.

You can save a data frame saved in your workspace with:

save(name_data_frame, "name_file.RData")

To open the data use:

load("name_file.Rdata")

Note: This code works if you have placed the .RData file in the same project folder as your Rmd file. You do not have to specify a file path.

Data sets in R (packages)

R has many in-built data sets. The command data() will give a list of all in-built data sets (also the data included in the non-base packages that are activated).

Open an in-built data set as follows:

require(MASS) # load the package MASS that contains the mammals data.
data(mammals) # load the mammals data

Importing delimited data files

Text files (.txt) can be imported in to R with:

read.table("mammalsleep.txt")

CSV (comma seperated values) files can be imported with the readr package from tidyverse:

read_csv("filename.csv")

Read and write statistical data formats

There are many packages that facilitate importing/exporting other data formats from statistical software:

  • SPSS: the function read_spss from package haven (but also other data formats from Stata and SAS)
  • Mplus: package MplusAutomation
  • Stata: read.dta() in foreign
  • SAS: sasxport.get() from package Hmisc
  • MS Excel:
    • function read.xlsx() from package openxlsx
    • function read_excel() from package readxl

haven by Hadley Wickham provides wonderful functions to import and export many data types from software such as Stata, SAS and SPSS.

For a short guideline to import multiple formats into R, see e.g. http://www.statmethods.net/input/importingdata.html.

Linear regression

Content

  1. The linear model lm()

  2. Simple regression

  3. Multiple regression

Linear Model

The linear regression model:

\[y_i=\beta_0+\sum_{j}\beta_{j} x_{ij}+\varepsilon_i, \ \ \ \ \ \ \varepsilon_i\sim N(0, \sigma^2)\] where

  • \(y_i\) is score of individual \(i\) on the numeric dependent variable \(Y\)

  • \(x_{ij}\) is the score of individual \(i\) on predictor \(X_j\)

  • \(\beta_0\) is the intercept

  • \(\beta_j\) is the slope of \(X_j\)

  • \(\varepsilon_{i}\) is the residual (prediction error)

The lm() function

lm(formula, data) # returns an object of class lm
formula model
y ~ 1 intercept-only
y ~ x1 main effect of x1
y ~ x1 + x2 main effects of x1, x2
y ~ . main effects of all predictors
y ~ . - x1 main effects of all predictors except x1
y ~ x1 + x2 + x1:x2 main effects + interaction between x1 and x2
y ~ x1*x2 idem
y ~ .^2 main effects + pairwise interactions between all predictors

Simple regression

Continuous predictor

  • observed = fitted + residual

\[y_i=\hat{y}_i+\varepsilon_i\] - fitted = intercept + slope times x

\[\hat{y}_i=\beta_0 + \beta_1x_i\]


  • fitted changes with \(\beta_1\) for each unit increase in \(x\)

Regression line for mpg ~ disp

ggplot(mtcars, aes(disp, mpg)) + 
  geom_point() +
  xlim(0, 500) +
  geom_smooth(method = "lm", se = FALSE, fullrange=T)

Coefficients

Interpretation the parameter estimates.

(fit <- lm(mpg ~ disp, mtcars))

Call:
lm(formula = mpg ~ disp, data = mtcars)

Coefficients:
(Intercept)         disp  
   29.59985     -0.04122  

Structure of lm object

str(fit)
List of 12
 $ coefficients : Named num [1:2] 29.5999 -0.0412
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "disp"
 $ residuals    : Named num [1:32] -2.01 -2.01 -2.35 2.43 3.94 ...
  ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
 $ effects      : Named num [1:32] -113.65 -28.44 -1.79 2.65 3.92 ...
  ..- attr(*, "names")= chr [1:32] "(Intercept)" "disp" "" "" ...
 $ rank         : int 2
 $ fitted.values: Named num [1:32] 23 23 25.1 19 14.8 ...
  ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
 $ assign       : int [1:2] 0 1
 $ qr           :List of 5
  ..$ qr   : num [1:32, 1:2] -5.657 0.177 0.177 0.177 0.177 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
  .. .. ..$ : chr [1:2] "(Intercept)" "disp"
  .. ..- attr(*, "assign")= int [1:2] 0 1
  ..$ qraux: num [1:2] 1.18 1.09
  ..$ pivot: int [1:2] 1 2
  ..$ tol  : num 1e-07
  ..$ rank : int 2
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 30
 $ xlevels      : Named list()
 $ call         : language lm(formula = mpg ~ disp, data = mtcars)
 $ terms        :Classes 'terms', 'formula'  language mpg ~ disp
  .. ..- attr(*, "variables")= language list(mpg, disp)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "mpg" "disp"
  .. .. .. ..$ : chr "disp"
  .. ..- attr(*, "term.labels")= chr "disp"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(mpg, disp)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "mpg" "disp"
 $ model        :'data.frame':  32 obs. of  2 variables:
  ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
  ..$ disp: num [1:32] 160 160 108 258 360 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language mpg ~ disp
  .. .. ..- attr(*, "variables")= language list(mpg, disp)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "mpg" "disp"
  .. .. .. .. ..$ : chr "disp"
  .. .. ..- attr(*, "term.labels")= chr "disp"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. .. ..- attr(*, "predvars")= language list(mpg, disp)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "disp"
 - attr(*, "class")= chr "lm"

Summary of lm object

summary(fit)

Call:
lm(formula = mpg ~ disp, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.8922 -2.2022 -0.9631  1.6272  7.2305 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 29.599855   1.229720  24.070  < 2e-16 ***
disp        -0.041215   0.004712  -8.747 9.38e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.251 on 30 degrees of freedom
Multiple R-squared:  0.7183,    Adjusted R-squared:  0.709 
F-statistic: 76.51 on 1 and 30 DF,  p-value: 9.38e-10

Structure summary lm object

str(summary(fit))
List of 11
 $ call         : language lm(formula = mpg ~ disp, data = mtcars)
 $ terms        :Classes 'terms', 'formula'  language mpg ~ disp
  .. ..- attr(*, "variables")= language list(mpg, disp)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "mpg" "disp"
  .. .. .. ..$ : chr "disp"
  .. ..- attr(*, "term.labels")= chr "disp"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(mpg, disp)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "mpg" "disp"
 $ residuals    : Named num [1:32] -2.01 -2.01 -2.35 2.43 3.94 ...
  ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
 $ coefficients : num [1:2, 1:4] 29.59985 -0.04122 1.22972 0.00471 24.07041 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "(Intercept)" "disp"
  .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
 $ aliased      : Named logi [1:2] FALSE FALSE
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "disp"
 $ sigma        : num 3.25
 $ df           : int [1:3] 2 30 2
 $ r.squared    : num 0.718
 $ adj.r.squared: num 0.709
 $ fstatistic   : Named num [1:3] 76.5 1 30
  ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
 $ cov.unscaled : num [1:2, 1:2] 1.43e-01 -4.85e-04 -4.85e-04 2.10e-06
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "(Intercept)" "disp"
  .. ..$ : chr [1:2] "(Intercept)" "disp"
 - attr(*, "class")= chr "summary.lm"

Extracting lm list elements:

Function / Subsetting Output
coef(fit) / fit$coef coefficients
fitted(fit) / fit$fitted fitted values
resid(fit) / fit$resid residuals
summary(fit)$r.squared R-squared statistic
fit$coef
(Intercept)        disp 
29.59985476 -0.04121512 
summary(fit)$r.squared
[1] 0.7183433

Categorical predictor

Dummy variables

Categorical predictors are converted into dummy variables:

  • each category has a dummy with value 1 for that category, and 0 otherwise

  • except for the reference category (0 on all dummies)

  • all categories are compared to the reference category

Reference category of \(z\) is \(a\)

Interpreting dummies

Model for categorical \(Z\) with categories \(a, b, c\):

\[\hat{y}=\beta_0+\beta_1zb+\beta_2zc\]


parameters interpretation
\(\beta_0\) predicted mean category \(a\) (reference category)
\(\beta_0+\beta_1\) predicted mean category \(b\)
\(\beta_0+\beta_2\) predicted mean category \(c\)

Example

Interpret the parameter estimates of model mpg ~ factor(am)

  • am = 0 is automatic and am = 1 is manual transmission

  • reference category is am = 0

coef(lm(mpg ~ factor(am), mtcars))
(Intercept) factor(am)1 
  17.147368    7.244939 

Regression line

ggplot(mtcars, aes(am, mpg)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Predictor gear with 3 categories

Works the same as above, but one more dummy needed

  • linear model with 2 dummies

  • reference category gear = 3

lm(mpg ~ factor(gear), mtcars)

Call:
lm(formula = mpg ~ factor(gear), data = mtcars)

Coefficients:
  (Intercept)  factor(gear)4  factor(gear)5  
       16.107          8.427          5.273  

Regression lines

Two regression lines

  • one from reference 3 gears to 4 gears

  • one from reference 3 gears to 5 gears

Factor or not?

The model mpg ~ gear interprets the number of gears as numeric

  • only one slope is estimated

Model comparisons with anova()

Which model fits better:

  1. mpg ~ gear

  2. mpg ~ factor(gear)


Model 2 has one more parameter, so it is expected to fit better.

  • But is it significantly better?


We can check this with an \(F\)-test for the \(R^2\)-change

  • The function to do this test is anova(model 1, model 2)

ANOVA

Compare the fit of the models:

  • mpg ~ 1
  • mpg ~ gear
  • mpg ~ factor(gear)
anova(lm(mpg ~ 1, mtcars),
      lm(mpg ~ gear, mtcars),
      lm(mpg ~ factor(gear), mtcars))
Analysis of Variance Table

Model 1: mpg ~ 1
Model 2: mpg ~ gear
Model 3: mpg ~ factor(gear)
  Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
1     31 1126.0                                
2     30  866.3  1    259.75 11.719 0.001864 **
3     29  642.8  1    223.49 10.083 0.003534 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multiple regression

Main-effects model

No interactions in the model, e.g.

\[\hat{y}=\beta_0+\beta_1x_1+\beta_2x_2\]

Interpretation

  • the slope of \(x_1\) does not depend on \(x_2\), and vice versa

  • the slopes of \(x_1\) and \(x_2\) are corrected for the correlation \(r_{x_1,x_2}\)

Model with disp and gear

lm(mpg ~ disp, mtcars)$coef %>% round(3)

lm(mpg ~ gear, mtcars)$coef %>% round(3)

lm(mpg ~ disp + gear, mtcars)$coef %>% round(3)
(Intercept)        disp 
     29.600      -0.041 
(Intercept)        gear 
      5.623       3.923 
(Intercept)        disp        gear 
     29.105      -0.041       0.111 
  • The slope of gear almost disappeared in the last model.

Interactions

The slope of one predictor depends on the value of the other

lm(mpg ~ disp * factor(am), mtcars) 

Call:
lm(formula = mpg ~ disp * factor(am), data = mtcars)

Coefficients:
     (Intercept)              disp       factor(am)1  disp:factor(am)1  
        25.15706          -0.02758           7.70907          -0.03145  
  • The slope of disp for am = 0 is -0.2758

  • The slope of disp for am = 1 is -0.2758 - 0.03145 = -0.30725

Visualization of the interaction model

There is obviously an interaction

ggplot(mtcars, aes(disp, mpg, col = factor(am))) + 
  geom_point() + 
  geom_smooth(method = "lm", se=F)

Comparison coefficients

           rowname main.effects interaction
1      (Intercept)  27.84808111 25.15706407
2             disp  -0.03685086 -0.02758360
3      factor(am)1   1.83345825  7.70907298
4 disp:factor(am)1           NA -0.03145482

Compare the plots

  • Main effects model: same slopes, different intercepts (not visible)

  • Interaction model: different slopes and different intercepts

ANOVA

Compare the fit of the main effects and interaction model

anova(lm(mpg ~ disp + factor(gear), mtcars),
      lm(mpg ~ disp * factor(gear), mtcars))
Analysis of Variance Table

Model 1: mpg ~ disp + factor(gear)
Model 2: mpg ~ disp * factor(gear)
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1     28 317.01                                  
2     26 172.87  2    144.14 10.839 0.0003771 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Practical